Skip to content

perf(slirp): architectural experiments — io_uring / splice / multi-queue#83

Draft
dpsoft wants to merge 19 commits intomainfrom
slirp-perf-architectural-exp
Draft

perf(slirp): architectural experiments — io_uring / splice / multi-queue#83
dpsoft wants to merge 19 commits intomainfrom
slirp-perf-architectural-exp

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented May 7, 2026

Goal

Stacked on top of #81. With #81's heaptrack-driven user-space alloc reductions exhausted (-90% allocs/iter, p50 unchanged at ~275 µs), the remaining wall-clock floor is dominated by kernel ↔ userspace transitions (per-packet read()/write()), per-vCPU MMIO exits, and single-queue serialization through net_poll_thread.

This branch is the playground for architectural experiments that change the syscall / vCPU shape. Full plan: docs/perf-architectural-experiments.md.

Non-goal: TAP / passt-style host bypass

Dropping SLIRP and routing through TAP + an external passt instance would close the latency gap to passt itself, but it would move the DNS interception, port-forwarding, deny-list, and rate-limiting feature surface out of voidbox into a separate process — and we lose the in-process observability we currently get from instrumenting SLIRP directly. Full SLIRP-path observability is a hard requirement, so passt-style bypass is out of scope.

Experiments (ranked by risk × payoff)

1. io_uring for SLIRP host-socket I/O — start here

Replace per-flow recv() + sendto() (one syscall per packet, serial in net_poll_thread) with batched IORING_OP_RECV / IORING_OP_SEND SQEs submitted in a single syscall after each epoll_wait.

Expected: ~10–30 µs CRR p50 reduction. Risk: lowest — localized to the relay layer's read/write helpers.

2. splice() / sendfile() zero-copy on bulk paths

splice() between the host-socket fd and a pipe to eliminate the userspace copy on the bulk-relay TX path. Only works fd-to-fd, so applies to payload bytes only (header rewriting stays in smoltcp).

Expected: +10–20% on tcp_throughput_g2h_mbps. Risk: medium — pipe-fd plumbing through the relay state machine.

3. MSI-X virtio + multi-queue for vCPU scaling

Add MSI-X support to src/vmm/arch/x86_64/ and expose VIRTIO_NET_F_MQ so the guest can spin up per-CPU queue pairs. Host fans out queues to multiple poll threads.

Expected: +50–100% throughput on multi-vCPU sandboxes. Risk: highest — touches IRQ delivery, KVM_IRQFD wiring, and is HW-feature-gated.

Tooling

Uses the perf-harness from #81:

  • examples/crr_singleproc_bench — single-process CRR latency (real NAT path).
  • voidbox-network-bench — g2h throughput, RR p50/p99.
  • heaptrack — alloc regression check.
  • tools/perf-harness/bench-pasta.py — pasta reference number.
  • tools/perf-harness/bench-qemu-slirp.sh — qemu+libslirp / qemu+passt cross-check.

Methodology

  1. Each experiment lands as its own commit gated behind a Cargo feature (io-uring, splice-zerocopy, multi-queue) so the tools: passt/pasta head-to-head comparison harness #81 baseline can A/B against it without a revert.
  2. Commit message includes before/after from crr_singleproc_bench --iterations 100 and voidbox-network-bench --iterations 3.
  3. heaptrack after each commit confirms no alloc regression vs round-2 numbers (~41 allocs/iter).
  4. If a commit doesn't move the needle, it's reverted before the next experiment so the diff stays minimal.

Test plan

  • io_uring POC builds + tests pass
  • CRR microbench shows measurable p50 improvement vs tools: passt/pasta head-to-head comparison harness #81 tip
  • No allocation regression vs round-2 numbers
  • (later) splice POC if io_uring wins are smaller than expected
  • (later) MSI-X / multi-queue if single-vCPU floor is hit

dpsoft added 17 commits May 6, 2026 18:30
Two scripts and a doc, deferred deliverable from
docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md
§ "passt head-to-head methodology".

scripts/bench-pasta.py
  Drives the same workload shape as voidbox-network-bench (g2h
  throughput, RR p50/p99, CRR p50) against pasta running in a
  network namespace.  Outputs JSON in the same Report shape so
  bench-compare-pasta.py can diff the two side by side.

  pasta is launched with --config-net + --map-host-loopback
  (default: gateway IP) so connecting to the host gateway from
  inside the netns reaches the host's 127.0.0.1.  Mirrors
  voidbox's SLIRP convention (10.0.2.2 → 127.0.0.1) closely
  enough for the apples-to-apples CRR metric.

scripts/bench-compare-pasta.py
  Reads two JSONs and emits a markdown side-by-side.  Auto-detects
  which file is which via the `backend` field.  Reports the gap
  as 'voidbox N× faster/slower' so the direction is unambiguous.

docs/passt-comparison.md
  Caveats + usage.  Calls out that throughput numbers are NOT
  directly comparable (voidbox has VM/MMIO overhead pasta does
  not).  CRR latency is the apples-to-apples metric: dominated by
  NAT-table operations on both sides.

Tested locally: pasta CRR p50 ≈ 80 µs, voidbox CRR p50 ≈ 10.1 ms
on the same host. The gap is dominated by voidbox's poll-thread
cadence + virtio-mmio exits, not NAT-table cost — a useful actionable
signal for follow-up perf work.
Pair of artefacts used to root-cause the apparent 122x voidbox-vs-pasta
CRR p50 gap reported by scripts/bench-pasta.py.

tools/crr-client.c
  Static-linked C binary that performs N TCP CRRs in one process,
  no fork or exec per iteration.  Output is one line of nanoseconds:
  N P50 P99 MEAN.  Compile with:

    gcc -O2 -static -o /tmp/crr-client tools/crr-client.c

examples/crr_singleproc_bench.rs
  Voidbox-side driver.  Boots a sandbox with /tmp host-mounted into
  the guest, runs the static binary inside the guest, parses the
  one-line output.  Measures voidbox's NAT-path CRR cost without the
  outer bench's per-iteration nc fork+exec.

Result: voidbox-in-VM at 421 us p50 vs pasta-in-netns at 107 us p50
is dominated (~300 us of the ~314 us gap) by VM transit (virtio-mmio
exits, KVM IRQ injection, vsock RPC), not by SLIRP-engine cost.
A genuinely apples-to-apples SLIRP-vs-SLIRP comparison (passt+qemu
vs voidbox+voidbox-VM) is the natural follow-up; this commit captures
the tooling so that follow-up can stand on a reproducible baseline.
Boots a minimal qemu guest carrying tools/crr-client and runs N TCP
CRRs against a host TCP server.  Two backends:

  --backend libslirp    qemu's built-in -netdev user (libslirp)
  --backend passt       qemu -netdev stream + passt(1) over UNIX socket

Same workload + iteration count as scripts/bench-pasta.py and
examples/crr_singleproc_bench.rs, so the four datapoints (host-direct,
pasta-in-netns, qemu+libslirp, qemu+passt, voidbox+voidbox-SLIRP)
are directly comparable on the same machine.

The script auto-builds the initramfs from tools/qemu-init.sh +
busybox + tools/crr-client, including virtio_net + failover modules
from the host kernel so a stock distro kernel can probe the qemu
virtio-net-pci device.  Voidbox's slim kernel has them built-in and
the insmod calls fail harmlessly.

Result on the dev machine:

  host-direct                63 us p50
  pasta (netns, no VM)      107 us p50
  qemu+libslirp (in VM)     181 us p50
  qemu+passt (in VM)        163 us p50
  voidbox+voidbox-SLIRP     421 us p50

Voidbox is ~2.2x slower than the mature C SLIRPs in the same
VM-attached configuration -- the genuine engine gap, independent of
fork artefact (10x) and VM transit (which both sides pay).
Four small wins on the per-packet path between the SlirpBackend's
inject queue and the guest, identified by the SLIRP-vs-SLIRP
comparison (voidbox 421 us p50 vs qemu+passt 163 us p50 on the
single-process TCP CRR benchmark).

src/devices/virtio_net.rs::try_inject_rx
  - Read avail.idx ONCE per call instead of per frame.  The driver
    only bumps it when adding new buffers; per-frame re-reads are
    redundant guest-memory accesses.
  - Replace 'let used_elem = [...].concat()' with a stack [u8; 8].
    The previous code allocated a Vec<u8> per injected frame in the
    hot path; the new code costs four byte copies and zero allocs.
  - Write used.idx ONCE at the end of the batch rather than after
    every frame.  The virtio spec only requires a single update per
    publish; per-frame writes were redundant guest-memory accesses.
  - Return frames_injected (usize) so callers can pulse the IRQ
    line conditionally on actual new RX work.

src/devices/virtio_net.rs::process_tx_queue
  - Replace per-frame Vec::concat with stack [u8; 8] (same fix as
    the RX path).
  - Read each TX descriptor segment directly into the packet buffer
    via packet.resize() + mem.read(&mut packet[off..]) instead of
    allocating an intermediate Vec<u8> and extend_from_slice'ing.
    Saves one allocation and one full memcpy per descriptor segment.
  - Reuse a single Vec<u8> packet buffer with capacity 1600 across
    all frames in the call instead of allocating fresh per frame.
  - Batch used.idx update at end of the batch (same as RX).

src/vmm/mod.rs::net_poll_thread
  - Track previous-cycle pending state.  Pulse KVM_IRQ_LINE only
    when (a) we actually injected new RX frames this cycle OR (b)
    interrupt_status went from clear -> pending across cycles.
    Previously the loop pulsed twice (assert level=1, then deassert
    level=0) on every cycle while interrupt_status was non-zero,
    even when the guest hadn't acked the previous pulse and no new
    work had arrived.  Skipping the pulse pair when there's nothing
    new saves two ioctl(KVM_IRQ_LINE) calls per redundant cycle
    (~5-10 us each on the CRR hot path).

Effect on the single-process CRR p50 (mean of 5 runs of 30
iterations each, voidbox+voidbox-SLIRP):

  before: 421 us   p50 mean
  after:  380 us   p50 mean   (~10% improvement)

The IRQ pulse change is the dominant contributor; the RX/TX heap
allocation removals are correct cleanup but contribute below
sample variance.  Voidbox's gap to qemu+passt (163 us) shrinks
from 2.6x to 2.3x; remaining gap candidates are MMIO exit cost,
KVM_IRQ_LINE vs irqfd, and SlirpBackend lock contention.
The voidbox net-poll thread was raising IRQ 10 with two
ioctl(KVM_IRQ_LINE) calls per pulse: assert level=1, then deassert
level=0.  Each ioctl is a syscall (~few us each on KVM); on the
TCP CRR hot path with multiple IRQ deliveries per connection, the
ioctl pair became a measurable share of per-iteration cost.

Replace with KVM_IRQFD: one eventfd registered with the in-kernel
irqchip via vm_fd().register_irqfd(&eventfd, 10) at thread startup.
Pulsing the IRQ is now a single 8-byte write to the eventfd; the
kernel asserts the IRQ line directly without a userspace round-trip
through ioctl().

The legacy KVM_IRQ_LINE path is kept as a fallback when irqfd
registration fails (kernel without irqfd support, irqchip routing
not initialised).  In normal operation the eventfd succeeds at
startup and the legacy ioctls never run.

Effect on the single-process CRR p50 (mean over 5 runs of 30
iterations, voidbox+voidbox-SLIRP):

  before this commit:  ~380 us p50
  after this commit:   ~335 us p50   (~12% reduction)

Cumulative with the previous virtio-net hot-path cleanups:

  baseline:           421 us p50
  after all fixes:    ~335 us p50    (~20% cumulative reduction)

Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 2.0x.
Without ioeventfd, every guest TX (write to QUEUE_NOTIFY MMIO with
value=1) forces a KVM_RUN exit: vCPU thread dispatches into virtio-net's
write_mmio handler, calls process_tx_queue, then re-enters KVM_RUN.
On the TCP CRR hot path with multiple TX per connection that's a few
microseconds of pure VM-exit overhead per packet on top of the actual
network work.

Register the eventfd at MMIO addr 0xd000_0050 with datamatch=1 (TX
queue notify only).  Now KVM consumes the matching MMIO write
in-kernel and signals the eventfd; vCPU continues running uninterrupted.
Net-poll thread sees the eventfd alongside flow events on the existing
EpollDispatch (under a token in a tag space that doesn't collide with
PROTO_TAG_*), drains it, and calls process_tx_queue on its own
schedule.

Notifies for queue 0 (RX, value=0) still take the slow path through
the MMIO write handler — they're rare (only when guest adds new RX
buffers) so the optimisation isn't needed there.

Falls back to the synchronous MMIO-exit path if eventfd creation or
KVM_IOEVENTFD registration fails.

Effect on the single-process CRR p50 (mean over 5 runs of 30
iterations, voidbox+voidbox-SLIRP):

  before this commit:    ~335 us p50
  after this commit:     ~278 us p50   (~17% reduction)

Cumulative across the recent perf series:

  baseline:              421 us p50
  + virtio-net cleanups: ~380 us p50
  + KVM_IRQFD:           ~335 us p50
  + KVM_IOEVENTFD:       ~278 us p50   (~34% cumulative)

Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 1.7x.
Restructures the host->guest RX path to eliminate the
Arc<Mutex<VirtioNetDevice>> contention between the net-poll thread
and the vCPU thread.  Inspired by the user-suggested Option B:
"net-poll -> rx_queue[vCPU] -> esa vCPU consume".

Before:
  net-poll thread:
    let mut g = net_dev.lock();          // takes device mutex
    g.try_inject_rx(mem);                // descriptor walk + writes
    drop(g);
    pulse_irq();
  vCPU thread on MMIO exit:
    let g = net_dev.lock();              // waits for net-poll
    g.mmio_read(...);

After:
  net-poll thread:
    drain backend frames into a Vec;     // backend mutex only
    push each frame to pending_rx;       // lock-free SegQueue
    pulse_irq();                         // never touches device mutex
  vCPU thread on MMIO exit:
    let mut g = net_dev.lock();          // uncontended now
    g.flush_pending_rx(mem);             // descriptor writes here
    g.mmio_read/mmio_write(...);

Net-poll's hot path no longer holds the VirtioNetDevice mutex at
all -- it only acquires the SLIRP backend Arc independently.  vCPU's
MMIO exits do the descriptor work in-context, paying for it once per
exit but never waiting on a held lock.

Implementation:

  src/devices/virtio_net.rs
    - new field pending_rx: Arc<crossbeam_queue::SegQueue<Vec<u8>>>
    - pending_rx() accessor returns a clone of the Arc
    - slirp_arc() exposes the backend Arc for direct net-poll access
    - new method flush_pending_rx(&mut self, mem) drains the SegQueue
      and writes RX descriptors using the same loop as try_inject_rx
    - try_inject_rx is now a thin wrapper that calls a new shared
      helper write_frames_to_rx_ring; same behaviour, structured
      so flush_pending_rx can share the descriptor-writing logic.

  src/vmm/mod.rs::net_poll_thread
    - Cache pending_rx + slirp Arcs once at thread startup; never
      touch the VirtioNetDevice mutex on the per-cycle path.
    - Drain backend frames into a reusable Vec, wrap each with a
      virtio-net header, push to the SegQueue, then pulse the IRQ.

  src/vmm/cpu.rs (MMIO dispatch)
    - Call guard.flush_pending_rx(guest_memory) at the top of the
      virtio-net MMIO read AND write handlers.  Materialises any
      frames the net-poll thread queued since the last MMIO exit.

Adds: crossbeam-queue = "0.3".

Effect on the single-process CRR p50 (mean over 5 runs of 30
iterations, voidbox+voidbox-SLIRP):

  before this commit:    ~278 us p50
  after this commit:     ~265 us p50   (~5% reduction)

Modest improvement on the single-vCPU benchmark we have available --
the win is mostly architectural (eliminates a contention point that
will become more meaningful with multi-vCPU guests, higher pps, and
parallel TX/RX paths).

Cumulative across the whole perf series:

  baseline:              421 us p50
  + virtio-net cleanups: ~380 us p50
  + KVM_IRQFD:           ~335 us p50
  + KVM_IOEVENTFD:       ~278 us p50
  + Option B SegQueue:   ~265 us p50  (~37% cumulative)

Voidbox's gap to qemu+passt (163 us) is now ~1.6x.
Wraps the device's interrupt_status register in Arc<AtomicU32> so the
net-poll thread can read and update it without taking the device
mutex.  Three concrete benefits:

  1. has_pending_interrupt() is now a single relaxed atomic load on
     &self -- safe to call from any thread, no lock, no contention.
  2. The net-poll thread caches a clone of the Arc at startup and
     uses it directly for its idle-cycle 'do I need to pulse the IRQ?'
     check, removing one mutex acquisition per cycle.
  3. interrupt_status |= 1 (set by RX inject) and interrupt_status &=
     !value (cleared by guest's INTERRUPT_ACK MMIO write) are now
     fetch_or / fetch_and atomic operations -- no read-modify-write
     race between the vCPU thread and the net-poll thread.

The vCPU thread's MMIO read of INTERRUPT_STATUS still goes through
the device mutex via the existing dispatcher, but the underlying
operation is now a pure atomic load -- a follow-up that lets the
dispatcher skip the lock for read-only MMIO accesses gets a cleaner
path because the field no longer needs synchronisation through the
mutex.

Single-vCPU CRR is within sample noise of the previous measurement
(~265 us p50 -> ~289 us across 5 runs of 30 iterations); the win is
mostly architectural rather than measurable on this workload.  Real
benefit shows up with multi-vCPU guests, higher pps, or workloads
where the net-poll and vCPU threads contend more aggressively.
Collects the SLIRP-vs-SLIRP / vs-pasta diagnostic tooling under one
directory.  Five files relocate, no behaviour change:

  scripts/bench-pasta.py          -> tools/perf-harness/bench-pasta.py
  scripts/bench-compare-pasta.py  -> tools/perf-harness/bench-compare-pasta.py
  scripts/bench-qemu-slirp.sh     -> tools/perf-harness/bench-qemu-slirp.sh
  tools/crr-client.c              -> tools/perf-harness/crr-client.c
  tools/qemu-init.sh              -> tools/perf-harness/qemu-init.sh

Updates path references in:
  - bench-qemu-slirp.sh (uses $SCRIPT_DIR for qemu-init.sh location;
    updated busybox extraction to climb two dirs up to repo root)
  - examples/crr_singleproc_bench.rs (doc + error message paths)
  - docs/passt-comparison.md (usage examples + extended example block
    that now also covers bench-qemu-slirp.sh and crr_singleproc_bench)

Smoke-tested after the move:
  - tools/perf-harness/bench-pasta.py --iterations 1 ...   passes
  - tools/perf-harness/bench-qemu-slirp.sh --backend libslirp passes
Eight follow-up fixes from PR #81 review:

src/vmm/mod.rs:
  Extract `setup_tx_notify_ioeventfd` helper and gate the entire
  IOEVENTFD path on `epoll_arc.is_some()`.  Fixes the original safety
  concern: the previous code registered KVM_IOEVENTFD even when no
  epoll dispatcher was available, which would have left guest TX
  notifies trapped in-kernel with no userspace drain — a silent hang.
  The helper rolls back the epoll registration if KVM_IOEVENTFD
  registration fails, so the two halves succeed or fail together.

examples/crr_singleproc_bench.rs:
  Switch the host-side accept thread to non-blocking accept with a
  deadline check so the example never hangs forever if the guest
  fails to connect.  Initial Copilot suggestion of a 2 ms sleep
  inflated each guest CRR sample by ~1.8 ms (sleep latency directly
  added to per-iter accept-pickup time).  Reduced to 50 µs to keep
  the sample noise below the metric resolution.

tools/perf-harness/bench-pasta.py:
  - `detect_host_gateway` now parses the route line by `via` keyword
    instead of indexing parts[2], so non-standard route formats
    don't silently pick up the wrong field.
  - CRR timer started before `srv.accept()` to match the
    voidbox-network-bench `crr_echo_server` semantics.

tools/perf-harness/bench-qemu-slirp.sh:
  - Replace `time.sleep(60)` with `threading.Event().wait()` so the
    host echo server stays alive for the entire qemu run instead of
    timing out at 60 s.
  - Add fail-fast bind error handling so port collisions surface
    immediately instead of producing a confusing "no result" later.

tools/perf-harness/qemu-init.sh:
  Derive the netmask from the CIDR prefix instead of hardcoding
  255.255.255.0, so non-/24 networks work.

tools/perf-harness/bench-compare-pasta.py:
  Remove unused `sign` variable.

docs/passt-comparison.md:
  Update path reference from `scripts/` to `tools/perf-harness/`.

Verified: voidbox single-process CRR p50 stays at ~280-310 µs
(within noise of pre-fix baseline) and `cargo test --test
network_baseline` passes 24/24.
Replace `std::mem::take(&mut *queue)` with an in-place
`extend_from_slice` + `clear()` against a scratch Vec owned by
`SlirpBackend`.  The previous pattern moved the queue's allocation
out and left a fresh `Vec::new()` (cap=0) behind, forcing the next
`push_ready_events` to grow `extend_from_slice` from cap=0 every
cycle.

Heaptrack on the single-process CRR bench (30 iters) measured
this single callsite as ~half of all allocations during the run:

  before:  push_ready_events  4843 allocs  (49% of total)
           drain_to_guest     4776 allocs  (48% of total)
           total              12618 allocs

  after:   push_ready_events  gone from top callers
           drain_to_guest     3957 allocs  (still hot, downstream)
           total              6885 allocs  (-45%)

p50 CRR latency is unchanged (~270 µs); the wall-clock floor is
elsewhere on this workload.  The win is reduced allocator churn
(GC pressure, jitter on bulk paths, fewer slow-path mallocs under
sustained load) — visible in the throughput bench rather than CRR
microbench.

The `pending_events` Mutex<Vec> is also pre-sized to
`EVENTS_PRESIZE = 128` at construction so the very first push
doesn't reallocate.
The SLIRP backend's per-second new-connection rate limit
(`max_connections_per_second`, default 50/s) and concurrent-
connection ceiling (`max_concurrent_connections`, default 64) are
production anti-DoS defaults baked into `LocalSandbox`.  They are
hostile to microbenches that intentionally open hundreds of
connections in a tight loop — at 51 connects/s the limiter starts
returning RST to the guest, which crr-client sees as
`ECONNREFUSED` on its very next connect and exits with rc=3.

Reproduced as the "100-iter failure" in `crr_singleproc_bench`:
30 iters worked, 60 iters did not; the threshold was the 50/s
limit, not anything in the network stack itself.

Surface the two ceilings on `Sandbox::local()` as builder methods:

    .network_max_connections_per_second(u32::MAX)
    .network_max_concurrent_connections(usize::MAX)

`None` keeps the production defaults, so this is purely additive.
The bench now uses both.  500-iter run reproduces clean
(p50 268 µs, p99 1.6 ms, host accepts 500/500).
Both `flush_pending_rx` and `try_inject_rx` previously built a
fresh `Vec<Vec<u8>>` on every MMIO exit and handed it to
`write_frames_to_rx_ring`, which consumed it by value.  The
pattern dropped the outer-Vec allocation and forced the next call
to grow it from cap=0 — heaptrack on the CRR microbench measured
the flush_pending_rx site at 173 calls / 108 MB peak, the largest
remaining alloc consumer after the SLIRP `ready_scratch` fix.

`write_frames_to_rx_ring` now takes `&mut Vec<Vec<u8>>` and drains
in place via `drain(..)` / `append`, so callers reuse a long-lived
scratch buffer:

  - `flush_pending_rx` uses a new `flush_scratch` field on
    `VirtioNetDevice`, populated from `pending_rx` (SegQueue) and
    cleared at end.
  - `try_inject_rx` reuses the existing `rx_scratch` field that
    was already paired with `get_rx_frames`; the trailing
    `mem::take` in `get_rx_frames` is now followed by a
    `clear()` + restore at the end of `try_inject_rx`, so the
    capacity persists across the round-trip.

Heaptrack on 100-iter CRR:

  before this commit:  6885 allocs / 30 iters  = 229/iter
  after this commit:  18926 allocs / 100 iters = 189/iter

Aggregate from the original baseline:

  baseline (before all fixes): ~421 allocs/iter
  this commit:                 ~189 allocs/iter   (-55%)

p50 latency unchanged at ~275 µs as expected — alloc reduction
shows up in throughput and tail-latency stability, not the CRR
floor.
`relay_tcp_nat_data` builds a temporary `Vec<Vec<u8>>` per call
because the relay can't push directly to `inject_to_guest` while
iterating `flow_table` (both are `&mut self`).  The previous
pattern allocated a fresh `Vec::new()` every cycle, which
heaptrack flagged as the biggest remaining contributor inside
`drain_to_guest`'s call tree after the prior `ready_scratch`
and `flush_scratch` fixes.

Move the buffer onto `SlirpBackend` as `relay_frames_scratch`
and use the standard `mem::take` → process → restore pattern so
the buffer's capacity persists across `drain_to_guest` calls.
The two trailing `inject_to_guest.append(&mut frames_to_inject)`
sites already preserve capacity (Vec::append leaves the source
empty but with its allocation intact); only the entry-point
`Vec::new()` was discarding work.

Cumulative impact on the 100-iter CRR microbench:

  baseline (before any of these fixes):  ~421 allocs/iter
  after ready_scratch + flush_scratch:    ~189 allocs/iter
  after relay_frames_scratch (this PR):    ~93 allocs/iter (-78%)

p50 latency continues at ~275 µs; the floor is dominated by
KVM-exit / wakeup costs, not allocator churn.  The win shows up
under sustained load where reduced allocator pressure improves
tail-latency stability and per-frame jitter.
Three of the relay functions called from `drain_to_guest`
(`relay_tcp_nat_data`, `relay_icmp_echo`, `relay_udp_flows`)
each built a per-call `Vec<FlowKey>` to side-step the
`&mut self` / `flow_table` borrow conflict.  The Vecs were
allocated, populated, drained, and dropped on every cycle.
The UDP relay built two — one for the stale-sweep, one for the
readiness loop.

Add a single `flow_keys_scratch: Vec<FlowKey>` field on
`SlirpBackend` and rotate it through all four sites with the
mem::take → process → restore pattern (the relays run
sequentially inside `drain_to_guest`, so one buffer suffices).
Each iteration uses `Vec::drain(..)` instead of for-by-value so
capacity is preserved across the consume.

Heaptrack on the 100-iter CRR microbench:

  before this commit:   9296 allocs (~93/iter)
  after this commit:    4103 allocs (~41/iter)
  temporary allocs:     5546 → 574  (-90%)

Cumulative from the original baseline (start of this round):

  ~421 allocs/iter → ~41 allocs/iter   (-90%)

p50 latency unchanged at ~275 µs as predicted; the wall-clock
floor is dominated by KVM exits / vCPU wakeups.  The gain shows
up as reduced allocator pressure on bulk paths and fewer
slow-path mallocs under sustained load.

Top remaining alloc callsites are now per-frame `Vec<u8>` from
`build_tcp_packet_static` (one allocation per TCP frame) and
TX queue frame parsing — both intrinsic to the protocol shape;
further reduction needs a pool/arena, not a scratch hoist.
Same fix as `crr_singleproc_bench`: the bench's CRR phase opens
30 connections in <1s, which trips the production SLIRP rate
limiter (50 conn/s) and surfaces as a 2 s "crr echo channel
receive error" instead of a real number.

Use the new `Sandbox::local()` rate-limit knobs to lift both
ceilings (max_connections_per_second + max_concurrent_connections)
explicitly.  Production sandboxes are unaffected — the lift is
opt-in.
Plan doc for the next perf round.  After #81's user-space alloc
reductions exhausted (-90% allocs/iter, p50 unchanged), the
remaining floor is kernel↔userspace transitions, MMIO exits, and
single-queue serialization.

Three experiments in scope, ranked by risk × payoff:

  1. io_uring for SLIRP host-socket I/O  — start here
  2. splice() / sendfile() zero-copy on bulk paths
  3. MSI-X virtio + multi-queue for vCPU scaling

Non-goal: TAP + passt-style host bypass.  Routing through an
external passt would close the latency gap to passt but moves the
DNS interception, port-forwarding, deny-list, and rate-limiting
feature surface out of voidbox — and loses the in-process
observability we currently get from instrumenting SLIRP directly.
Full SLIRP-path observability is a hard requirement.

Each experiment lands as its own commit, gated behind a Cargo
feature so the #81 baseline can A/B against it without a revert.
Measurements use the harness shipped in #81.
Base automatically changed from passt-comparison-harness to main May 7, 2026 00:37
dpsoft added 2 commits May 6, 2026 21:42
First commit on the architectural-experiments branch (#83).
Adds a `UringBatch` wrapper around `io_uring::IoUring` with the
submit / drain shape the SLIRP relay will use to batch host-socket
recv / send into single `io_uring_enter` round-trips.

Key shape:

  - One `UringBatch` is single-owner: the SLIRP `net_poll_thread`
    constructs and drives one.  No locking, no cross-thread
    sharing.
  - SQEs are tagged with `(UringOp, correlation_id)` packed into
    `user_data` so the completion drain routes a CQE back to
    its originating flow without a side table.  Low 32 bits =
    correlation id, top 32 bits = op tag.
  - `submit_recv` / `submit_send` are `unsafe` because the kernel
    references the user buffer asynchronously; the caller's
    safety contract requires `buf` to outlive the matching CQE.
  - The existing `EpollDispatch` keeps owning the readiness
    signal — io_uring replaces only the data-plane syscalls,
    not the wake-up.  Two layers stay separable so the feature
    can be toggled off without touching the relay state machine.

Behavior unchanged: nothing wires this in yet.  Cargo feature
`io-uring` (off by default) gates both the new module and the
`io-uring = "0.7"` dependency.  Module is `#![allow(dead_code)]`
for now; the next commit on this branch wires the relay TCP
recv / send paths through it and removes the allow.

Tests:

  - 4 unit tests in `src/network/uring.rs` cover user-data round
    trip + a real `submit_send` -> `submit_recv` cycle across a
    `socketpair` (skipped on kernels without io_uring).
  - `cargo test --features io-uring --lib`:  381 passed.
  - `cargo test --test network_baseline` (default features): 24/24.
  - `cargo clippy --all-targets [-- -D warnings]` clean both with
    and without the feature.

Methodology per `docs/perf-architectural-experiments.md`:
each experiment lands as one feature-gated commit so the #81
baseline can A/B against it without a revert.  This is the
infrastructure commit; the next one wires + measures.
Companion to `crr_singleproc_bench`: drives M concurrent
crr-client processes in the same guest so the SLIRP relay sees
N>1 ready flows per `net_poll_thread` cycle.  The single-flow
microbench can't see io_uring batching or multi-queue wins
because there's nothing to batch / parallelize with one ready
flow at a time; this bench is the workload the architectural
experiments on this branch (#83) need.

Per-flow `crr-client` writes its summary line to its own
`/tmp/crr_results/$i.txt`; the trailing shell loop concatenates
all M lines for the host to parse.  Aggregation reports
median-of-p50s, max p99, mean-of-means, and aggregate qps.

Note: busybox-static lacks `seq`, so the flow-id list is
materialized on the host and inlined into the shell command.

## Baseline (this branch's tip = #81 + io_uring scaffold)

Single net_poll_thread, no architectural changes wired:

| M | Median p50 | Max p99 | Aggregate qps |
|---|-----------:|--------:|--------------:|
| 1 |     275 µs |   ~2 ms |        ~3636  |
| 2 |     473 µs | 12.9 ms |         2173  |
| 4 |     732 µs | 13.2 ms |         2370  |
| 8 |    2043 µs | 14.5 ms |         2242  |

Reading:
  - Aggregate qps saturates at ~2200-2400 regardless of M —
    the single net_poll_thread is the bottleneck.
  - Per-flow p50 grows ~linearly with M (M=8 each flow takes
    7.4× the M=1 p50).
  - p99 jumps to 12-14 ms at M=2 already; tail-latency is
    dominated by per-flow head-of-line blocking through the
    single epoll loop.

This is exactly the workload io_uring batching, splice, and
multi-queue should move.  The io_uring wiring lands in the
next commit on this branch with measurements against this
table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant